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1.  Introduction 


The  object  of  this  paper  is  to  describe  fast  parallel  algo¬ 
rithms  for  multiplying  matrices  of  arbitrary  sizes  having  inte¬ 
ger  elements  of  arbitrary  magnitude.  These  algorithms  are 
based  on  the  earlier  work  of  Klette  [3]  but  attempts  are  made 
to  improve  them  by  the  use  of  modular  arithmetic  for  computa¬ 
tion.  The  use  of  modular  arithmetic  in  effect  improves  the  algo¬ 
rithms  in  [3]  when  the  matrix  product  has  elements  which  are 
very  high  precision  numbers  of  the  order  of  2e  (or  e  bits) .  It 
is  shown  that  if  two  matrices  of  size  (mxm)  are  multiplied,  and 
we  choose  r  prime  numbers  k^  each  of  precision  ei  bits,  the 
time  taken  is  as  follows: 

Single-instruction-multiple-data  algorithm 

2  r  2 
0(re  +  m(e  log  m  +  Z  e. ) )  , 

i-1  1 

Multiple-instruction-multiple-data  algorithm 
O(log  r  log  e  +  (log  em) (max  log  ei>).  1 

Since  matrix  multiplication  is  an  important  operation  for 
realizing  matrix  transformations  and  inversions  of  matrices 
[7,12]  it  is  believed  that  the  present  approach  will  have  prac¬ 
tical  utility. 

The  organization  of  this  paper  is  as  follows:  Section  2 
describes  the  basic  principle  of  the  parallel  matrix  multiplication 

^Logarit>  as  are  always  taken  to  the  base  2. 


algorithms;  Section  3  explains  the  design  features  of  a 
parallel  processor  for  bit-wise  computation;  the  time  com¬ 
plexity  of  the  various  possible  schemes  are  discussed  in 
Section  4;  the  use  of  fast  multiplication  schemes  for  further 
speedup  is  considered  in  Section  5;  finally,  the  speedup  that 
can  result  by  the  use  of  modular  arithmetic  is  described  in 


Section  6 


2.  Principle  and  definitions 


We  denote  the  two  matrices  A  =  (a^)  and  ®  *  (b^)  of  size 
(kxm)  and  (mxn)  respectively.  Our  aim  is  to  compute  A*  8  of 
size  kxn  with  elements 


p^*ipbp5  £or  1  *  1,2 . k'  3  ’  1<2 . "  ' 

By  'parallel'  we  mean  that  the  operations  transform  the  operand 
matrix  as  if  all  the  matrix  elements  were  processed  simultaneously. 
For  the  practical  realization  of  such  a  scheme  we  can  think  of 
fiber  optics  [10]  or  CCD  techniques  [8]. 

We  now  describe  the  principal  idea  of  our  algorithm: 

Let  C[p]  denote  a  (kxn)  matrix  in  which  all  the  columns  are 
identical  with  the  p-th  column  of  the  matrix  A;  let  R[p]  denote 
a  (kxn)  matrix  in  which  all  the  rows  are  identical  with  the  p-th 
row  of  the  matrix  8.  For  two  matrices  C  and  R  of  the  same  size 


kxn,  let 

C*R  »  (cijrij)i  =  1,2 . k,  j  =  1,2, ...,n 

be  the  modified  product  of  these  two  matrices  C  and  R.  It  im¬ 
mediately  follows  that 
m 

Lemma:  A*8  »  E  C[p]*R[p]. 

P-1 

We  illustrate  this  lemma  by  an  example  in  Figure  1. 

This  lemma  can  be  used  for  parallel  computation  of  A* 8, 
either  in  SIMD  (single-instruction-multiple-data)  mode  or  MIMD 


(multiple-instruction-multiple-data)  mode,  using  Flynn's  [1] 
notation.  In  writing  these  algorithms  we  use  an  Algol-like 
notation  for  brevity. 

(1)  SIMD  mode. 

Algorithm  PS 

begin  matrices  A,B,C,R,A*8; 
integer  m,p; 

initialize  A* 8  identical  0; 
for  p  »  1  step  1  until  m  do 

C  *  C[p]  using  A,  and  R  =  R[p]  using  B; 

C  *  C*R; 

A* 8  =  A-B+C 

od; 

end 

Remark :  This  algorithm  involves  m  steps  of  computation,  each 
with  one  construction  of  C[p]  and  R[p],  one  modified  matrix  pro¬ 
duct,  and  one  matrix  addition.  If  we  regard  these  as  primitive 
operations,  we  obtain  A«8  in  SIMD  mode  in  0(m)  operations.  This 
algorithm  was  described  in  Klette  [3]. 

(2)  MIMD  mode. 

For  realization  in  this  mode,  it  is  assumed  that  m  is  an  in¬ 


tegral  power  of  2;  where  it  is  not  we  assume  log  m  is  replaced  by 
its  higher  integral  part  in  the  algorithm  described  below. 


Algorithm  PM 


1  step  1  until  m  do  in  parallel 


for  i  =  1  step  1  until  log  m  do 


1  step  2  until  m-1  do  in  parallel 


construction  of  C [p]  and  R[p]  and  one  modified  matrix  product 

for  p  =  l,2,...,m  in  parallel.  Then  we  compute  in  log  m  steps 

m 

of  parallel  matrix  addition  the  desired  sum  E  C[p]  =  A* 8  in 

.  P=1 

the  matrix  field  C [ 1 ] .  Assuming  that  these  are  primitive  in¬ 
structions  for  a  certain  MIMD  computer,  A*B  can  be  computed  ir 


3.  Parallel  processor  for  bit-wise  computation 

We  assume  that  the  elements  of  the  matrices  A  and  8  are 
integers  from  the  set  0,1,2, ...,2e-l  for  e*l.  (Note  that 
this  does  not  restrict  the  choice  of  negative  numbers  as  ele¬ 
ments  of  A  and  8,  as  complement  notation  can  be  used.)  Then 
each  matrix  A  (or  6)  can  be  encoded  in  a  bijective  manner 
through  a  sequence  of  e  Boolean  matrices  Ae_i» Ae_2 ' ' ‘ ' A0  of 
the  same  dimensions  as  A,  where 

Ae-l(i'^)  Ae-2 (i ' j }  A0(i,j) 

is  the  positional  binary  representation  of  the  elements ,  viz . , 
A(i,j)  =  a^...  In  other  words,  the  matrices  Ag_1 , Ae_2 *  •  •  •  •  Ao 
represent  the  bit  planes  of  the  matrix  A.  For  illustration,  an 
example  is  provided  below: 

Let  A  =  ;  then 

A4A3A2A1A0  =  [o  o]  [l  o]  [l  l]  [o  l]  [o  l]  * 

It  is  clear  that  using  such  a  positional-additive  decompo¬ 
sition,  the  matrix  operations  (such  as  add,  multiply)  can  be 
realized  using  Boolean  operations  and  shifts  on  sequences  of 
such  Boolean  matrices. 

(1)  Addition  in  SIMD  mode. 

Let  Ae_i,Ae_2' • • • »A0  and  Be_i'Be-2' • *  * 'B0  be  binar^ 
matrix  sequences  of  same  dimensions.  Then  the  following  algo¬ 
rithm  (3)  performs  the  matrix  addition  A+B  using  Boolean 


operations  on  the  binary  matrices  (bit-wise  parallel)  as  basic 
steps  in  SIMD  mode. 

Algorithm  AS 

begin  binary  matrices  AQ , . . . ,  Ae_1#  BQ , . . .  ,Be_1# CQ  , . . . /C^^C 

D,  CARRY; 

integer  e , i ; 

CQ  =  Aq®B0;  CARRY  =  Aq ABq ;  i=l; 

while  i<e  do 

D  =  Ai®Bi;  Ci  =  D®CARRY ; 

CARRY  =  (A. AB . ) V (CARRYAD) ; 

X  X  • 

i  =  i+1; 

od; 

Ce  =  CARRY; 

end 

Remark;  The  sequence  of  the  outputs  C  ,C  C_  represents 

the  result  A+ 8.  Using  bitwise  parallel  Boolean  operations  on 
binary  matrices  this  algorithm  in  SIMD  mode  has  time  complexity 
0(e). 

Note  that  the  basic  idea  of  this  matrix  addition  algorithm 
is  similar  to  adding  two  e-bit  numbers  in  sequential  mode. 
Figure  2  illustrates  this  algorithm. 

(2)  Addition  in  MIMD  mode. 


In  [9]  a  SIMD  algorithm  has  been  described  for  adding  e-bit 
numbers  in  parallel  in  0(log  e)  steps,  using  parallel-bitwise 
Boolean  operations  and  shifts  in  vector  registers.  Using  this 


Figure  2.  Illustration  of  Algorithm  AS 


IMMI 


■.J/t ii  a\» 


SIMD  algorithm  we  now  give  an  algorithm  in  MIMD  mode  for  adding 
binary  matrices  Ae_1,...,A()  and  in  O(log  e)  time. 

This  algorithm  is  most  efficient  when  e  is  a  power  of  2;  where 
it  is  not  we  assume  log  e  is  replaced  by  its  higher  integral 
part  in  the  algorithm  described  below. 


ALGORITHM  AM 


begin  binary  matrices  A0»  •  •  • /Vl'B0' *  *  *  ,Be-l,C0' ’  *  * 'Ce-l'Ce'D0' *  *  ’ ' 
integer  e , i , j ; 

for  i  =  0  step  1  until  e-1  do  in  parallel 


end 


Ci  =  D.  =  A.VB.,- 


od; 


for  i  =  0  step  1  until  log  e-1  do 

for  j  =  2X  step  1  until  e-1  do  in  parallel 

Cj  =  CJV(CJ-2i  A  V: 

Dj  ’  DjADj-2l! 

od; 

od; 

C  =  0o  , ; 
e  e-l 

for  i  =  1  step  1  until  e-1  do  in  parallel 


C.  =  C .  .  ®  A.  ®  B . ; 

l  l-l  i  l 


od; 


C0  =  VBo; 


Remark :  T’  -j  sequence  of  the  outputs  C  ,C  Cn  represents 

©  6"  x  0 


the  result  A+8.  This  algorithm  in  MIMD  mode  has  O(log  e)  time 


complexity.  In  Figure  3  we  illustrate  this  algorithm. 

(3)  Generation  of  C[p]  and  R[p]  in  SIMD  mode. 

The  processes  of  generating  C [p] ,  p  ■  l,2,...,m  and 
R[p],  P  =  l,2,...,m  are  important  operations  in  our  algorithm. 

Let  A  =  Ae_i»  Ae_2**  *  ,,Ao  ke  the  matrix  of  size  mxm  where  m  is  a 
power  of  2.  The  following  SIMD  algorithm  computes  C  .  ,C  ~...Cn 
corresponding  to  C [ 1 ]  (see  [3]).  The  algorithm  uses  power-of- 
two  shifts  of  Boolean  matrices;  such  shifts  are  denoted  by  •<-21 
(left}  -*-21  (right)  ,  t21(up)/  4-21(down)  where  the  shift  distance 
is  21.  While  performing  such  two-dimensional  shifts,  the  vacated 
columns  or  rows  are  filled  with  zeros,  and  the  shifted-out 
columns  or  rows  are  discarded. 

Algorithm  CS 

beqin  binary  matrices  AM,...,A  , ,C C  , ,M; 

-  u  e— x  o  e— x 

integer  e ,  i ,  j  ; 

in  M  the  leftmost  column  is  identically  1,  otherwise  M  is  0; 
i  =  0; 

while  i<e  do 

=  A^AM;  j=0; 
while  j<log  m  do 

Ci  =  CiV(Ci^2j) ;  j  =  j+1; 
od; 

i  =  i+lj 
od 

end 
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Figure  3.  Illustration  of  algorithm  AM 


Remark :  Using  bitwise  parallel  Boolean  operations  and  power-of- 


two  shifts  on  binary  matrices  this  algorithm  has  time  complexity 
0(e  log  m) .  The  principle  of  this  algorithm  is  easy  to  under¬ 
stand,  demonstrating  the  utility  of  power-of-two  shifts  using 
a  divide-and-conquer  approach.  The  construction  of  matrices 
C  [2]  , . . . ,  C  [m]  ,  R  [1]  , . . . ,  R  [m]  can  be  performed  within  the  same 
time,  using  an  analogous  algorithm. 

(4)  Generation  of  C[p]  and  R[p]  in  MIMD  mode. 

The  construction  of  C[l]  can  be  done  within  time  O(log  m) 
using  the  following  algorithm  in  MIMD  mode. 

Algorithm  CM 

Change  i-loop  in  algorithm  CS  into  a  parallel  instruction. 

Remark:  The  generation  of  the  matrices  C[l], _ , C[m],R[l], _ _ 

R[m]  can  be  done  within  time  O(log  m)  in  MIMD  mode. 

(5)  Modified  matrix  product  in  SIMD  mode. 

In  what  follows  we  shall  assume  that  bit-wise  parallel 
Boolean  operations  and  power-of-two  shifts  on  Boolean  matrices 
are  basic  operations  in  our  computational  model.  We  will  call 
this  model  a  Parallel  Binary  Matrix  Processing  System  (PBS) . 

(For  a  more  detailed  description  of  this  model  see  [3,4,5].) 

In  this  model  the  size  of  the  matrix  registers  is  restricted  to 
the  original  size  of  the  input  matrices.  In  Pratt  et  al.  [9], 
however,  a  similar  model  for  computation  is  described  (with 
restrictions  on  one-dimensional  registers)  with  unrestricted 
registers.  It  is  our  view  that  PBS  offers  a  more  realistic  model 


for  computation,  since  programs  in  the  PBS  model  take  linear 
space  while  programs  in  Pratt/S tockmeyer '  s  model  [9]  use  expo¬ 
nential  space  for  problems  such  as  matrix  multiplication. 


The  essential  advantage  of  PBS  for  computing  the  modified 

product  is  that  well-known  techniques  used  for  computations  with 

binary  numbers  can  be  translated  directly  to  the  case  of  binary- 

2 

matrix  computations.  In  fact,  using  the  usual  0(e  )  algorithm 

for  multiplying  e-bit  numbers,  one  can  realize  the  modified 

matrix  product  of  two  matrices  Ag^A^,  •  •  •  »A0  and  Be-l'Be-2'  * '  * ' 

2 

Bq  in  time  complexity  0(e)  in  SIMD  mode. 

Algorithm  MS 

bs^in^insr^jii&trics^  Aq , . .  • , i'Bo'  *  *  *  ,Be“l,^0  /  *  *  * 2e—  i^o 1 

. . • 'Ee_i'Do ' • • * ,De-l'De; 

integer  e, i, j ; 

C0  =  W 

for  i  =  0  step  1  until  e-2  do  D^»A^+^ABQ;  od; 

D  .  identical  0; 
e— l 

for  j=0  step  1  until  e-1  do 

for  i=0  step  1  until  e-1  do  Ei*A^AB ^ ;  od; 

addition  of  D  ,D  -...Drt  and  E  . E  -...E-  using 
e— x  e— 2  o  e— i  e— 2  u 

algorithm  AS,  resulting  in  DeDe_i***Dg' 

W 

for  i*0  step  1  until  e-1  do  Dj«D^+^;  od? 


for  i*0  step  1  until  e-1  do  ce+i*Di  *  od > 


Remark:  The  sequence  C2e-l'C2e-2* '  ,Co  rePresents  t^ie  modi 

2 

product  A*8.  This  program  runs  within  time  0(e  ). 

(6)  Modified  matrix  product  in  MIMD  mode. 

In  Pratt  et  al.  [9],  a  SIMD  algorithm  can  be  found  for 


the  parallel  realization  of  the  multiplication  of  two  e-bit 
numbers  in  O(log  e)  steps  using  parallel  bitwise  Boolean  opera¬ 
tions  and  shifts  on  vector  registers.  Using  this  same  basic 
idea  we  can  give  an  algorithm  in  MIMD  mode  for  computing  the  modi 

fied  multiplication  of  matrices  A  .A  _...Art  and  B  .B 

e— l  e—  2  0  e— i  e— 2.  o 

running  in  time  O(log  e) .  This  translation  can  be  done  in  a 
manner  similar  to  the  method  used  in  algorithm  AM:  One  parallel 
operation  on  vector  registers  can  be  translated  into  one  parallel 


instruction  on  sequences  of  Boolean  matrices.  Unfortunately, 

the  resulting  MIMD  algorithm  for  PBS  requires  a  very  large  number 

2 

of  matrix  registers  (0 (m  ) ) . 


Algorithm  MM 


Translation  of  the  SIMD  algorithm  for  multiplication  of 
binary  numbers  in  Pratt  et  al.  [9]  into  MIMD  program  for  PBS. 
With  this  algorithm  all  operations  used  in  the  algorithms  PS  and 


PM  are  implemented  on  PBS,  either  in  SIMD  mode  or  in  MIMD  mode 


4.  Time  complexit' 


In  Section  2,  we  described  two  basic  algorithms  for  com¬ 
puting  the  matrix  product,  one  in  SIMD  mode  and  one  in  MIMD 
mode.  In  Section  3  we  described  two  ways  to  implement  these 
algorithms  on  PBS  in  SIMD  and  MIMD  modes.  Thus  altogether  we 
have  at  least  four  different  ways  to  compute  the  matrix  products 
in  parallel;  we  will  now  summarize  the  time  complexity  for  these 
four  cases  assuming  that  A  and  8  are  mxm  matrices  (with  m  a 
power  of  two)  and  that  the  elements  in  A  and  8  are  from  the  set 
{0,1,2, ...,2e-l}. 

Case  i:  The  general  SIMD  algorithm  with  SIMD  implementation  on 
PBS  has  time  complexity  0(m  e  (log  m+  e) )  . 

Case  ii:  The  general  SIMD  algorithm  with  MIMD  implementation  on 
PBS  has  time  complexity  O(m(log  m+log  e) ) . 

Case  iii;  The  general  MIMD  algorithm  with  SIMD  implementation 
on  PBS  has  time  complexity  0(e(log  m+e) ) . 

Case  iv:  The  general  MIMD  algorithm  with  MIMD  implementation  on 
PBS  has  time  complexity  O(log  m  log  e) . 

In  our  opinion.  Case  i  is  presently  a  realistic  mode  for 
technical  implementation  (see  [10]).  In  what  follows  we  refer 
to  Case  i  as  the  single-instruction-multiple-data  algorithm.  In 
Section  6  we  will  use  these  two  approaches  and  discuss  ways  to 
improve  the  speed  by  using  residue  or  modular  arithmetic  and 
other  procedures;  see  Knuth  [6]  and  Krishnamurthy  et  al.  [7]. 


Since  in  practical  numerical  analysis  of  matrix  operations, 
the  precision  of  the  result  is  of  considerable  importance, 

2 

it  xs  necessary  to  examine  whether  the  time  complexity  0(e) 
required  for  SIMD  implementation  (Cases  i  and  iii)  and  0(log  e) 
for  MIMD  implementation  (Cases  ii  and  iv)  can  be  reduced,  assum¬ 
ing  m  to  be  a  constant. 


5 .  Use  of  fast  multiplication  schemes 

2 

It  is  well  known  (Knuth  [6])  that  the  order-e  method  is 

not  the  quickest  way  to  multiply  e-bit  numbers.  For  example 

Karatsuba's  [2]  method  involves  only  elog3=e1*5^  steps  using 

the  conventional  sequential  implementation.  This  method  leads 

to  a  SIMD  algorithm  for  the  modified  matrix  product  on  PBS 

1  59 

running  within  the  same  time  0(e  *  ).  We  will  call  this 

algorithm  MSK. 

Algorithm  MSK 

Translation  of  Karatsuba's  method  into  a  SIMD  algorithm 
for  PBS  for  the  modified  product  A* 8. 

Remarks :  For  a  detailed  description  of  this  and  other  methods 

for  multiplying  e-bit  numbers,  see  Knuth  [61.  In  our  view  algo¬ 
rithm  MSK  is  suitable  for  practical  realization  and  so  we  will  use 
this  for  our  purposes.  For  Cases  (i)  and  (iii)  described  in 

Section  4,  this  algorithm  leads  to  the  following  improvements: 

1  59 

Case  i:  Time  Complexity  0 (m  e  log  m  +  m  e  ) . 
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Case  iin  Time  Complexity  0(e  log  m  +  e  ). 

In  the  next  section  we  will  discuss  the  application  of 


residue  arithmetic  and  consider  the  time  complexities  for  the 
SIMD  and  MIMD  algorithms. 


i 


n-  -E3 


Let  k  =  30  =  2-3-5;  k]L  =  2,  k2  =  3,  k3  =*  5. 


Step  1;  Matrices  A  and  B  are  encoded  modulo  k^k^k-j. 


A (mod  2) 


8 (mod  2 )  = 


■E3 

-K3 


A  (mod 


8 (mod  3 ) 


-E3 

”-E3 


A (mod  5)  = 


8 (mod  5)  = 


E:] 

E3 


Step  2:  Product  of  A (mod  k^)  and  8 (mod  k^)  modulo  k.,  . 


0  O' 
_0  1 


2  1 
0  0_ 


r42i 

\j  ij 


Step  3:  Product  of  the  result  in  Step  2  with  k/k^  for  1  *  i  *  3, 


0  O' 
0  15_ 


20  10" 
_0  0_ 


"24  12“ 
12  6_ 


Step  4:  Addition  of  the  results  in  Step  3  modulo  k. 


14  221  =  A  •  1 
_12  21 


Figure  4.  Modular  approach  to  matrix  product. 


Szabo  et  al.  [11],  Knuth  [6],  Krishnamurthy  et  al.  [7],  Young 
et  al.  [12].  Here  we  will  use  a  straightforward  multiplicative- 
additive  approach,  in  which  each  element  of  the  product 
A(mod  kj)*B(mod  k^)  mod  k^  is  multiplied  by  (k/kj)^kj*  (where 
denotes  Euler ' s  totient  function)  for  1  *  j  *  r  and  summed 
modulo  k,  as  illustrated  in  Figure  4. 

An  important  aspect  in  using  modular  arithmetic  is  the 
choice  of  k ^ ' s  suitable  for  practical  implementation.  It  seems 
convenient  from  a  practical  design  viewpoint  that  each  k^  be  a 
prime  of  the  form  (2ej-l) ,  since  this  would  then  correspond  to 
bitwise  computation  on  PBS.  (As  an  example  e^  =  2, 3, 5, 7). 

Since  we  are  interested  in  implementing  the  modular  approach 
on  PBS,  let  us  describe  the  general  algorithms  for  implementation 
in  SIMD  and  MIMD  modes. 

Let  A^Afmod  k^) ,  8i=8(mod  k^)  for  1  *  i  *  r. 

(1)  SIMD  mode. 

Algorithm  PMS 

begin  matrices  A, 8, A* , 8* , C . R. V,  A*8; 
integer  m,r,k,klf . . . ,k  ,i, j ; 
initialize  A* 8  identical  0; 
for  i  =  1  step  1  until  r  do 

production  of  A*=Ai  and  8*=8i  using  A  and  B 
respectively; 

for  j  =  1  step  1  until  m  do 
initialize  V  identical  0; 


C  =  C[j]  using  A*  and  R  =  R[j]  using  B*, 


C  =  C*R  mod  k^; 

V  =  P+C  mod  k^; 
od; 

P  =  k/ki-P, 

A* 8  =  A-B+V  mod  k; 

od; 

end 

Remark ;  This  algorithm  has  time  complexity  0(rm),  regarding 
the  operations  used  as  primitive  operations. 

(2)  MIMD  mode. 

For  this  purpose  we  assume  that  r  is  a  power  of  2;  where 
it  is  not  we  assume  log  r  is  replaced  by  its  higher  integral 
part  in  the  algorithm  described  below. 

Algorithm  PMM 

bec[_in_jnat£ice£  A, A^# . . • 9 A  ^  f  8^ . . . f  8^/C^[l] t  . . . /  C  ^  [  m  ] t 
R 1  [  1  ]  ,  .  .  . ,  Rr  [m]  ; 
integer  m,r,k,k^,...,kr>i,j,h; 
for  i  =  1  step  1  until  r  do  in  parallel 

production  of  A^  and  8^  using  A  and  8  respectively; 
for  j»l  step  1  until  m  do  in  parallel 

compute  C^tjlfR^tj]  using  A.^,8^  respectively; 

Ci  Ij]“Ci  Ijl*Ri [jl  mod  ki; 

od; 


for  j =1  step  1  until  log  m  do 

for  h=l  step  2?  until  m-1  do  in  parallel 
Ci[h].-ei[h]+Ci[h+2j"1]  mod  k^- 

od; 

od; 

CjL[l]=k/ki*Ci[ll; 

od; 

for  i=l  step  1  until  log  r  do 

for  j=l  step  21  until  r-1  do  in  parallel 

C . [1]=C  .  [1 ]+C  .  , [1]  mod  k; 

3  3  j+21_x 

od; 

od; 

end 

Remark;  This  algorithm  has  time  complexity  O(log  m  +  log  r) , 
regarding  the  operations  used  as  primitive  operations.  The  out¬ 
put  A-8  is  computed  in  matrix  register  C^[l]. 

Implementation  in  PBS 

We  first  consider  the  addition  and  the  modified  product 
modulo  kh,  1  *  h  s  r.  According  to  the  well-known  techniques 
for  modular  arithmetic  (Knuth  [6]),  for  matrices  F  and  G  with 

elements  f. .  ,g. .  respectively,  each  with  e^-bit  length  (with 
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bit  positions  e^-1, e^-2 , . . . , 0 ,  the  sum  F+G  modulo  k^=2  -1  is 

easily  obtained  thus: 


ing  to  bit  position  e^, is  added  to  the  bit  planes  e^-1,  e^-2,.,.,0. 
In  other  words,  the  addition  modulo  k^  needs  twice  the  time  of 
addition  of  two  e^-bit  matrices,  i.e.  0(eh)  in  SIMD  mode  if  we 
use  algorithm  AS,  and  O(log  e^)  in  MIMD  mode  if  we  use  algorithm 


AM. 


For  obtaining  the  elements  of  the  modified  product  F*G 
(mod  k.  )  ,  the  rule  is: 

h  e  e 

fij*gij(mod  kh)  =  (fij‘gij  mod  2  h)  +  lfij'gij/2  h^mod  • 
To  do  this  calculation,  we  perform  the  modified  product  of  two 

eh~bit  matrices  F  and  G;  then  we  perform  an  addition  modulo  kh 

of  the  first  eh+l  bit  planes  and  the  last  eh  bit  planes.  Using 

the  algorithms  MS  and  AS  we  can  perform  this  on  PBS  in  time 
2 

0(eh)  in  SIMD  mode.  Using  algorithms  MM  and  AM,  the  modified 

product  can  be  computed  in  O(log  e^)  steps  in  MIMD  mode. 

Now  consider  the  construction  of  the  matrices  A^  (A  mod  k^) 

and  8^  (8  mod  k^) .  For  this  purpose,  we  can  group  together 

blocks  of  e^  bit  planes  of  a  given  matrix  A.  Let  these  blocks 
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of  bit  planes  be  denoted  by  the  matrices  A  ,A  ,...,A  .  Then 

we  perform  the  addition 

.  -  e . 

A1  +AZ  +  ...  +  Afc  mod  k^  (k^=2  1-1) 


(see  Knuth  [6]).  For  this,  in  SIMD  mode  we  need  0(e^(t-l)) 

=  0(e.  •  i22_JS)  s  o(e)  steps  on  PBS  if  we  assume  k  =  2e,  and 

1  1  e 

in  MIMD  mode  we  need  0(log  e.  log  t)  *  0(log  e.  log  — )  = 

2  1  ®i 

0(log  e^  log  e  -  (log  e^)  )  steps  on  PBS. 

Finally  we  consider  the  operation  of  matrix  addition  modulo 
k.  In  the  algorithms  PMS  and  PMM  we  have  the  operations 
A’ 8  =  A*B+P  mod  k,  and 

C.  [1]  =  C.  [1]  +  C  .  [1]  mod  k. 

J  3  j+21_i 

To  explain  this,  at  each  step  of  the  operation,  we  first  perform 
the  (usual)  addition  and  then  we  subtract  from  each  element  in 
parallel  the  value  k;  if  the  result  is  negative  we  retain  the 
element  value,  otherwise  we  retain  the  subtracted  result  as  the 
new  value  of  the  element.  This  is  because  of  the  fact  that  each 
element  mod  k^  when  multiplied  by  k/k^  can  be  in  the  range 
0,1,..., k-1  and  so  when  the  elements  are  added  modulo  k  the 
result  can  lie  in  the  range  0,1, . . . , 2k-2.  For  implementation 
on  PBS,  note  that  in  the  resulting  matrix  after  the  subtraction, 
an  overflow  appears  in  the  most  significant  bit  plane  in  all 
places  where  the  value  after  the  addition  was  greater  than  k, 
and  vice  versa.  We  use  this  most  significant  bit  plane  as  a  mask 
for  the  decision  between  the  original  value  and  the  subtracted 
value,  for  all  log  k  bit  planes.  Altogether,  the  matrix  addition 
modulo  k  used  in  the  algorithms  PMS  and  PMM  can  be  performed  on 
PBS  within  time  0(log  k)  =  0(e)  in  SIMD  mode,  and  within  time 
0(log  e)  in  MIMD  mode. 


We  will  now  summarize  the  time  complexity  for  the  SIMD 
and  MIMD  algorithms  using  modular  arithmetic.  For  the  algo¬ 
rithm  PMS  in  SIMD  implementation  we  have  the  time 

2  r  2 
0(re  +  m(e  log  m  +  £  e.)) 

i=l  1 

For  the  algorithm  PMM  in  MIMD  implementation  we  have  the  time 

O(log  r  log  e  +  (log  em) (max  log  ei)) 

i 

A  comparison  of  the  time  needed  for  the  SIMD  algorithms  here 
and  in  Section  4  shows  that  modular  arithmetic  is  to  be  pre¬ 
ferred  when  e  is  large  (high  precision)  and  m  is  greater  than 
r.  The  comparison  of  the  needed  time  for  the  MIMD  algorithms 
here  and  in  Section  5  also  indicates  that  modular  arithmetic 
is  to  be  preferred  when  m  is  greater  than  max  re^. 


7 .  Concluding  remarks 

We  have  presented  several  fast  parallel  matrix  multipli¬ 
cation  algorithms  in  this  paper.  Some  of  these  algorithms  can 
be  speeded  up  further  when  the  computations  involve  certain 
kinds  of  structured  matrices.  Such  structured  matrices,  as  we 
know,  arise  in  practical  problems  involving  various  kinds  of 
transformations  needed  in  matrix  computations,  e.g.  elementary, 
orthogonal  transformations.  This  will  be  a  fruitful  area  for 
further  study. 
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